[UX][k8s] show-gpus for all allowed contexts #5362

kyuds · 2025-04-25T02:45:18Z

IMPORTANT! Feedback Please
I didn't know whether it would be desirable, but I think it would be a good idea to show a final table that would merge all available accelerators across all k8s clusters? The user might want to only know the total availability across clusters. I think that might be a helpful summary feature. For instance, for the above image, we would have:

Kubernetes GPUs (all allowed contexts):
GPU: T4, TOTAL_GPUS: 2, TOTAL_FREE_GPUS: 2

EDIT (25.04.27): implemented above feature as suggested in below comment

Some Behavioral Changes:

specifying an accelerator and the kubernetes context (aka region) at the same time ignored the kubernetes context. Fixed that to take into account the k8s context.
in kubernetes_catalog. _list_accelerators(), the function was adding accelerator names to the dict and setting its availability to 0. I don't know if this is a real requirement (feedback would be great), but this creates an assertion error when checking the keyset sizes for accelerator counts, capacity, and available dicts when used in conjunction with a quantity filter. Therefore, added in checks to NOT add accelerator names to the dict when capacity is 0.
added a "total table" feature so that users can see a cumulative sum of GPUs across their registered k8s clusters in allowed contexts.

Tested (run the relevant ones):

Code formatting: install pre-commit (auto-check on commit) or bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

SeungjinYang · 2025-04-25T17:24:40Z

Re: UI, I do agree on having a table showing aggregated GPU availability across all clusters.

I actually think such table should be at the top, because the current UI places per-context GPU availability above node level availability, so it already has a flow of information going from more general -> more specific from top to bottom.

kyuds · 2025-04-27T05:12:35Z

Re: UI, I do agree on having a table showing aggregated GPU availability across all clusters.

I actually think such table should be at the top, because the current UI places per-context GPU availability above node level availability, so it already has a flow of information going from more general -> more specific from top to bottom.

implemented!

sky/cli.py

sky/clouds/service_catalog/kubernetes_catalog.py

sky/core.py

sky/cli.py

Co-authored-by: Seung Jin <seungjin219@gmail.com>

SeungjinYang

Thanks @kyuds!

Michaelvll · 2025-05-02T23:14:37Z

I am trying the latest master with this PR in. The UX for a single k8s with GPUs seems a bit weird to me:

Should we avoid the \n\n at the end?
Should we avoid the --- at the end of the table?
The color for Kubernetes per node accelerator availability does not align with the color scheme in our other UX. See the original UX below:

concretevitamin · 2025-05-02T23:18:07Z

Minor to above: would be good to replace None with -.

SeungjinYang · 2025-05-03T00:26:06Z

I actually think that if a node doesn't contain any GPU then it shouldn't show up on the table

Michaelvll · 2025-05-03T00:50:32Z

This PR also seems causing a backward compatibility issue due to the return value of the request from API server changes.

_get_kubernetes_realtime_gpu_table
    gpu_availability = models.RealtimeGpuAvailability(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: RealtimeGpuAvailability.__new__() missing 2 required positional arguments: 'capacity' and 'available'

Michaelvll · 2025-05-03T04:44:22Z

Another UX feedback:

It seems currently we show:

Total table
Context 1 GPUs
Context 1 Nodes

Context 2 GPUs
Context 2 Nodes
...

For SkyPilot users, GPUs are much more important than node information. A better way to show this is to show:

Total table
Context 1 GPUs
Context 2 GPUs

# We can even combine the context 1 and context 2 nodes in a single table
Context 1 Nodes
Context 2 Nodes
...

preliminary implementation

6ef62f9

kyuds marked this pull request as draft April 25, 2025 02:45

fix errors

c9e7964

kyuds marked this pull request as ready for review April 25, 2025 04:33

Michaelvll requested review from romilbhardwaj and SeungjinYang April 25, 2025 23:41

add total table for k8s gpus

8f96ad8

SeungjinYang reviewed Apr 28, 2025

View reviewed changes

sky/cli.py Outdated Show resolved Hide resolved

sky/cli.py Show resolved Hide resolved

sky/clouds/service_catalog/kubernetes_catalog.py Outdated Show resolved Hide resolved

sky/core.py Outdated Show resolved Hide resolved

resolve comments

f649a70

kyuds requested a review from SeungjinYang April 29, 2025 00:16

SeungjinYang reviewed Apr 29, 2025

View reviewed changes

sky/cli.py Show resolved Hide resolved

Update sky/cli.py

08f42b3

Co-authored-by: Seung Jin <seungjin219@gmail.com>

kyuds requested a review from SeungjinYang April 29, 2025 00:29

SeungjinYang approved these changes Apr 29, 2025

View reviewed changes

SeungjinYang merged commit 5fb3933 into skypilot-org:master Apr 29, 2025
22 checks passed

kyuds deleted the k8s/gpus branch April 29, 2025 01:17

zpoint mentioned this pull request Apr 30, 2025

Fix failure of test_kubernetes_context_failover #5455

Merged

2 tasks

kyuds mentioned this pull request May 3, 2025

[UX][k8s] backwards compatibility for k8s show-gpus #5488

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UX][k8s] show-gpus for all allowed contexts #5362

[UX][k8s] show-gpus for all allowed contexts #5362

kyuds commented Apr 25, 2025 •

edited

Loading

SeungjinYang commented Apr 25, 2025

kyuds commented Apr 27, 2025

SeungjinYang left a comment

Michaelvll commented May 2, 2025 •

edited

Loading

concretevitamin commented May 2, 2025

SeungjinYang commented May 3, 2025 •

edited

Loading

Michaelvll commented May 3, 2025 •

edited

Loading

Michaelvll commented May 3, 2025

[UX][k8s] show-gpus for all allowed contexts #5362

[UX][k8s] show-gpus for all allowed contexts #5362

Conversation

kyuds commented Apr 25, 2025 • edited Loading

SeungjinYang commented Apr 25, 2025

kyuds commented Apr 27, 2025

SeungjinYang left a comment

Choose a reason for hiding this comment

Michaelvll commented May 2, 2025 • edited Loading

concretevitamin commented May 2, 2025

SeungjinYang commented May 3, 2025 • edited Loading

Michaelvll commented May 3, 2025 • edited Loading

Michaelvll commented May 3, 2025

kyuds commented Apr 25, 2025 •

edited

Loading

Michaelvll commented May 2, 2025 •

edited

Loading

SeungjinYang commented May 3, 2025 •

edited

Loading

Michaelvll commented May 3, 2025 •

edited

Loading